Conversation
5ad1005 to
f7e897a
Compare
src/backend/decayrange.h
Outdated
|
|
||
| namespace snmalloc | ||
| { | ||
| template<typename Rep> |
There was a problem hiding this comment.
Given the reuse of the large buddy range rep here, at least a comment (or a concept) might be in order.
src/backend/decayrange.h
Outdated
| } | ||
|
|
||
| // We have run out of memory. | ||
| handle_decay_tick(); // Try to free some memory. |
There was a problem hiding this comment.
Does this need to be interlocked against the timer firing? I suppose not due to the prepend-only nature of all_local, the read-only nature of the spine traversal, and the use of pop_all for each found sizeclass... assuming that the parent range doesn't need interlocking, which, by default anyway, it doesn't (specifically, the parent will be a CommitRange whose parent is a GlobalRange by default, and CommitRange doesn't actually have state and GlobalRange is an interlock).
The presumption of concurrency-safeness of parent might merit being written down somewhere?
There was a problem hiding this comment.
Yeah, I plan to add a static constexpr to all the types that is the concurrency safe property like currently happens with Align. I just hadn't threaded it through yet. So GlobalRange would be true, CommitRange would be whatever the parent says, and the buddy would be false.
f6254d6 to
93d6e3f
Compare
|
So the perf of this is okay, but it increases memory footprint for some examples too much. I have factored out the primary changes to enable this into #491, so that can be landed, and the perf of this can be fixed and landed at a later point. |
37a1ce8 to
9694c96
Compare
|
This paper has a really interesting approach to work stealing of chunks between threads: I think we could use some of the ideas in this paper, to make the decay range perform better. |
|
BTW, I have recently worked on a Weak AVL Tree: which behaves in between of an AVL and a red black tree, adaptively based on insertion/deletion rate. If data structure performance is a concern, weak AVL may be worth a try. Do we require the pointer stability of the node? If not, btree is almost always faster. |
Oh, that is really interesting. We have a lot of constraints on the red-black tree code as it uses the pagemap as the storage for the nodes. This means it can only use 16bytes for a node, and about four bits of that are already reserved. |
|
Then WAVL should be a drop-in solution. There are two variants, one uses one-bit to store parity and another one uses two-bit to store rank-diff-flags. The second is a little bit faster, perhaps because it does not need to access the bit in the children to recover the 'two-bit' information. |
This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea. |
I can have a try, at least we can let some AI agent port it to bench first. Do you have specific instructions to replay the workload where rbtree consolidating is considered important. |
This commit exercises the rbtree pretty heavily: |
|
According to some primitive test (with x4 iteration compared to original test file): RBTree Replacement Benchmark Report (2026-02-25)Scope
Package + Remote Run
Environments
Local 10-Run Metric Stats (ns)
Local Hyperfine (10 runs)
Spark (aarch64) 10-Run Metric Stats (ns)
Spark Hyperfine (10 runs)
Notes
|
|
While the data structure should be correctly implemented, the codex's code appears very adhoc so I may need to craft this change by hands. Given that the 1-bit rank parity approach seems to be the most promising solution, I will just retain that single implementation. |
Implemenation of a range that gradually releases memory back to
the OS. It quickly pulls memory, but the dealloc_range locally caches
the memory and uses Pal timers to release it back to the next level
range when sufficient time has passed.